-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Specify map location when loading model #272
Conversation
✅ Deploy Preview for silly-keller-664934 ready!
To edit notification comments on pull requests, go to your Netlify site settings. |
Codecov Report
Additional details and impacted files@@ Coverage Diff @@
## master #272 +/- ##
======================================
Coverage 87.6% 87.6%
======================================
Files 26 26
Lines 2178 2178
======================================
Hits 1908 1908
Misses 270 270
|
Tests fail for me on a GPU machine with these logs: ===================================================================== FAILURES ======================================================================
_________________________________________________________ test_not_use_default_model_labels _________________________________________________________
dummy_trained_model_checkpoint = PosixPath('/tmp/pytest-of-bull/pytest-0/dummy-model-dir3/my_model/version_0/dummy_model.ckpt')
def test_not_use_default_model_labels(dummy_trained_model_checkpoint):
"""Tests that training a model using labels that are a subset of the model species but
with use_default_model_labels=False replaces the model head."""
original_model = DummyZambaVideoClassificationLightningModule.from_disk(
dummy_trained_model_checkpoint
)
model = instantiate_model(
checkpoint=dummy_trained_model_checkpoint,
scheduler_config="default",
labels=pd.DataFrame([{"filepath": "gorilla.mp4", "species_gorilla": 1}]),
use_default_model_labels=False,
)
> assert (model.head.weight != original_model.head.weight).all()
E RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!
Actually, I checked with PDB and here are the device each is on. Looks like maybe (Pdb) model.device
device(type='cpu')
(Pdb) original_model.device
device(type='cuda', index=0) |
@@ -110,10 +110,8 @@ def instantiate_model( | |||
return resume_training( | |||
scheduler_config=scheduler_config, | |||
hparams=hparams, | |||
species=species, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[cleanup] removed unused params
@pjbull I learned a couple useful things in debugging this. 1)A LightningModule already keep track of whether a gpu is available. We need to specify 2)
Our options are to:
|
I've gone ahead and made some simplifications.
Tests are passing locally in a fresh python 3.9 environment on a GPU and non-GPU machine. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It still seems a little weird to force cpu in all the from_disk
calls, and seems maybe preferable to avoid that if possible. If not, the rest of the change looks great to me!
return cls.load_from_checkpoint(path) | ||
def from_disk(cls, path: os.PathLike, **kwargs): | ||
# note: we always load models onto CPU; moving to GPU is handled by `devices` in pl.Trainer | ||
return cls.load_from_checkpoint(path, map_location="cpu", **kwargs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe this is fine, but forcing cpu here still feels a little weird to me. What if someone isn't using the trainer? Then they have to know and move devices themselves?
Just wondering if letting things happen automatically is it a problem?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just wondering if letting things happen automatically is it a problem?
With the recent torch changes, we need to specify a map location when loading from checkpoint (map_location cannot be None; that is the source of the failing tests). Trying to figure out if the user has a GPU and wants to use it at this point in the code just adds redundancy and complexity in a way that is not helpful (it's hard to summarize succinctly here but that is the route we initially tried and it was not a good option).
What if someone isn't using the trainer?
Whether they're using our train_model
code or writing their own PTL code, they'll still be using the trainer. If for some reason they're using our code to load the model but then not using PTL, then yes, they'll have to know and move devices.
Fixes #270
Fixes #271